Method and system for managing computer systems

ABSTRACT

A management system for a computer system is disclosed. The computer system operates or includes various products (e.g., software products) that can be managed in a management system or collectively by a group of management systems. Typically, the management system operates on a computer separate from the computer system being managed. The management system can make use of a knowledge base of causing symptoms for previously observed problems at other sites or computer systems. In other words, the knowledge base can built from and shared by different users across different products to leverage knowledge that is otherwise disparate. The knowledge base typically grows over time. The management system can use its ability to request information from the computer system being managed together with the knowledge base to infer a problem root cause in the computer system being managed. The computer system being managed can also request the management system to process its knowledge base for possible problem cause analysis. The management system can also continually identify persisting problem causing symptoms.

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of: (i) U.S. ProvisionalPatent Application No. 60/371,659, filed Apr. 10, 2002, and entitled“METHOD AND SYSTEM FOR MANAGING COMPUTER SYSTEMS,” which is herebyincorporated by reference herein; and (ii) U.S. Provisional PatentApplication No. 60/431,551, filed Dec. 5, 2002, and entitled “METHOD ANDSYSTEM FOR MANAGING COMPUTER SYSTEMS,” which is hereby incorporated byreference herein.

BACKGROUND OF THE INVENTION

[0002] 1. Field of the Invention

[0003] The present invention relates to computer systems and, moreparticularly, to management of computer systems.

[0004] 2. Description of the Related Art

[0005] Today's computer systems, namely enterprise computer systems,make use of a wide range of products. The products are oftenapplications, such as operating systems, application servers, databaseservers, JAVA Virtual Machines, etc. These computer systems often sufferfrom network and system-related problems. Unfortunately, given thecomplex mixture of products concurrently used by such computer systems,there is great difficultly in identifying and isolating ofapplication-related problems. Typically, when a problem occurs on acomputer system, it must first be isolated to a particular computersystem out of many different computer systems or to the networkinterconnect among these systems and also to a particular applicationout of many different applications used by the computer system. However,conventionally speaking, isolating the problem is difficult, timeconsuming and requires a team of application experts with differentdomain expertise. These experts are expensive, and the resulting downtime of computer systems is very expensive to enterprises.

[0006] Although management solutions have been developed, such solutionsare dedicated to particular customers and/or specific products.Monitoring systems are able to provide monitoring for events, but offerno meaningful management of non-catastrophic problems and prevention ofcatastrophic problems. Hence, conventional managing and monitoringsolutions are dedicated approaches that are not generally usable acrossdifferent computer systems using combinations of products.

[0007] Thus, there is a need for improved management systems that areable to efficiently manage computer systems over a wide range ofproducts.

SUMMARY OF THE INVENTION

[0008] Broadly speaking, the invention relates to a management systemfor a computer system. The computer system operates or includes variousproducts (e.g., software products) that can be managed in a managementsystem or collectively by a group of management systems. Typically, themanagement system operates on a computer separate from the computersystem being managed. The management system can make use of a knowledgebase of causing symptoms for previously observed problems at other sitesor computer systems. In other words, the knowledge base can built fromand shared by different users across different products to leverageknowledge that is otherwise disparate. The knowledge base typicallygrows over time. The management system can use its ability to requestinformation from the computer system being managed together with theknowledge base to infer a problem root cause in the computer systembeing managed. The computer system being managed can also request themanagement system to process its knowledge base for possible problemcause analysis. The management system can also continually identifypersisting problem causing symptoms.

[0009] The invention can be implemented in numerous ways including, as amethod, system, apparatus, and computer readable medium. Severalembodiments of the invention are discussed below.

[0010] As a management system for a computer system, one embodiment ofthe invention includes at least: a plurality of agents residing withinmanaged nodes of a plurality of different products used within thecomputer system, and a manager for said management system. The manageris operable across the different products.

[0011] As a method for managing an enterprise computer system, oneembodiment of the invention includes at least the acts of: receiving afact pertaining to a condition of one of a plurality of differentproducts that are operating in the enterprise computer system; assertingthe fact with respect to an inference engine, the inference engine usingrules based on facts; retrieving updated facts from the inference enginefrom those of the rules that are dependent on the fact that has beenasserted; and performing an action in view of the updated facts.

[0012] As a method for isolating a root cause of a software problem inan enterprise computer system supporting a plurality of softwareproducts, one embodiment of the invention includes at least the acts of:forming a knowledge base from causing symptoms and experienced problemsprovided by a disparate group of contributors; and examining theknowledge base with respect to the software problem to isolate the causeof the software problem to one of the software products.

[0013] Other aspects and advantages of the invention will becomeapparent from the following detailed description, taken in conjunctionwith the accompanying drawings, illustrating by way of example theprinciples of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

[0014] The present invention will be readily understood by the followingdetailed description in conjunction with the accompanying drawings,wherein like reference numerals designate like structural elements, andin which:

[0015]FIG. 1 is a block diagram of a management system according to oneembodiment of the invention.

[0016]FIG. 2 is a block diagram of a manager for a management systemaccording to one embodiment of the invention.

[0017]FIG. 3 is a block diagram of a GUI (Graphical User Interface)according to one embodiment of the invention.

[0018]FIG. 4 is a block diagram of a knowledge manager according to oneembodiment of the invention.

[0019]FIG. 5A is a diagram of a directed graph representing a knowledgebase.

[0020]FIG. 5B represents a small portion of knowledge provided in asegment of a directed graph (e.g., directed graph).

[0021]FIG. 5C represents a small portion of knowledge provided in asegment a directed graph (e.g., directed graph).

[0022]FIG. 6 is a block diagram of a knowledge processor according toone embodiment of the invention.

[0023]FIG. 7 is a block diagram of a management framework interfaceaccording to one embodiment of the invention.

[0024]FIG. 8 is a block diagram of a report module according to oneembodiment of the invention.

[0025]FIG. 9A is a diagram illustrating a knowledge base according toone embodiment of the invention.

[0026]FIG. 9B is an architecture diagram for a rule pack according toone embodiment of the invention.

[0027]FIG. 10 illustrates a relationship between facts, rules andactions.

[0028]FIG. 11 illustrates an object diagram for a representativeknowledge representation.

[0029]FIG. 12 is a block diagram of the managed node according to oneembodiment of the invention.

[0030]FIG. 13 is a block diagram of an agent according to one embodimentof the invention.

[0031]FIG. 14 is a block diagram of a master agent according to oneembodiment of the invention.

[0032]FIG. 15 is a block diagram of a sub-agent according to oneembodiment of the invention.

[0033]FIGS. 16A and 16B are flow diagrams of manager startup processingaccording to one embodiment of the invention.

[0034] FIGS. 16C-16E are flow diagrams of manager startup processingaccording to another embodiment of the invention.

[0035]FIG. 17A is flow diagram of master agent startup processingaccording to one embodiment of the invention.

[0036]FIG. 17B is a flow diagram of sub-agent startup processingaccording to one embodiment of the invention.

[0037]FIGS. 18A and 18B are flow diagrams of trigger/notificationprocessing according to one embodiment of the invention.

[0038]FIG. 19 is a flow diagram of GUI report processing according toone embodiment of the invention.

[0039] FIGS. 20-29 are screen shots of a representative Graphical UserInterface (GUI) suitable for use with one embodiment of the presentinvention.

DETAILED DESCRIPTION OF THE INVENTION

[0040] The invention pertains to a management system for a computersystem (e.g., an enterprise computer system). The computer systemoperates or includes various products (e.g., software products) that canbe managed in a management system or collectively by a group ofmanagement systems. Typically, the management system operates on acomputer separate from the computer system being managed. The managementsystem can make use of a knowledge base of causing symptoms forpreviously observed problems at other sites or computer systems. Inother words, the knowledge base can built from and shared by differentusers across different products to leverage knowledge that is otherwisedisparate. The knowledge base typically grows over time. The managementsystem can use its ability to request information from the computersystem being managed together with the knowledge base to infer a problemroot cause in the computer system being managed. The computer systembeing managed can also request the management system to process itsknowledge base for possible problem cause analysis. The managementsystem can also continually identify persisting problem causingsymptoms.

[0041] Embodiments of the invention are discussed below with referenceto FIGS. 1-29. However, those skilled in the art will readily appreciatethat the detailed description given herein with respect to these figuresis for explanatory purposes as the invention extends beyond theselimited embodiments.

[0042]FIG. 1 is a block diagram of a management system 100 according toone embodiment of the invention. The management system 100 serves tomanage a plurality of managed nodes 102-1, 102-2, . . . , 102-n. Each ofthe managed nodes 102-1, 102-2, . . . , 102-n respectively includes anagent 104-1, 104-2, . . . , 104-n. These agents 104 serve to monitor andmanage products at the managed nodes 102. In one implementation, theagents 104 are stand alone processes operating in their own processspace. In another implementation, the agents 104 are specific toparticular products being managed and reside at least partially withinthe process space of the products being managed. The agents 104 canmonitor and collect data pertaining to the products. Since the productscan utilize an operating system or network coupled to the managed nodes,the agents 104 are also able to collect state information pertaining tothe operating system or the network. In still another implementation,the agents 104 are an embodiment of Simple Network Management Protocol(SNMP) agents available from third-parties or system vendors.

[0043] The agents 104 can be controlled to monitor specific information(e.g., resources) with respect to user-configurable specifics (e.g.,attributes). The information (e.g., resources) being monitored can havezero or more layers or depths of specifics (e.g., attributes). Themonitoring of the information can be dynamically on-demand orperiodically performed. The information being monitored can be focusedor limited to certain details as determined by the user-configurablespecifics (e.g., attributes). For example, the information beingmonitored can be focused or limited by certain levels/depths.

[0044] Optionally, the agents 104 can also be capable of performingcertain statistical analysis on the data collected at the managed nodes.For example, the statistical analysis on the data might pertain torunning average, standard deviation, or historical maximum and minimum.

[0045] The management system 100 also includes a management framework106. The management framework 106 facilitates communications between theagents 104 for the managed nodes 102 and the manager 108. For example,different agents 104 can utilize different protocols (namely, managementprotocols) to exchange information with the management framework 106.

[0046] The management system 100 also includes a manager 108. Themanager 108 serves to manage the management system 100. Consequently,the manager 108 can provide cross-products, cross-systems andmulti-systems management in a centralized manner, such as for anenterprise network environment having multiple products or applicationswhich serve different types of requests. In an enterprise networkenvironment, the manager 108 has the ability to manage the varioussystems therein and their products and/or applications through a singleentity. Geographically, these systems and products and/or applicationscan be centrally located or distributed locally or remotely (evenglobally).

[0047]FIG. 2 is a block diagram of a manager 200 for a management systemaccording to one embodiment of the invention. For example, the manager200 illustrated in FIG. 2 can pertain to the manager 108 illustrated inFIG. 1.

[0048] The manager 200 includes a Graphical User Interface (GUI) 202that allows a user (e.g., an administrator) to interact with the manager200 to provide user input. The user input can pertain to rules,resources or situations. In addition, the user input with the GUI 202can pertain to administrative or configuration functions for the manager200 or output information (e.g., reports, notifications, etc.) from themanager 200. The input data is supplied from the GUI 202 to a knowledgemanager 204. The knowledge manager 204 confirms the validity of therules, resources or situations and then converts such rules, resourcesor situations into a format being utilized for storage in a knowledgebase 206. In one implementation, the format pertains to meta-datarepresented as JAVA properties. The knowledge base 206 stores the rules,resources and situations within the database in a compiled code format.

[0049] The manager 200 also includes a knowledge processor 208. Theknowledge processor 208 interacts with the knowledge manager 204 toprocess appropriate rules within the knowledge base 206 in view of anyrelevant situations or resources. In processing the rules, the knowledgeprocessor 208 often requests data from the agents 104 at the managednodes. Such requests for data are initiated by the knowledge processor208 and performed by way of a data acquisition unit 210 and a managementframework interface 212. The returned data from the agents 104 isreturned to the knowledge processor 208 via the data acquisition unit210 and the management framework interface 212. With such monitored datain hand, the knowledge processor 208 can evaluate the relevant rules.When the rules (evaluated by the knowledge processor 208 in accordancewith the monitored data received from the agents 104) indicate that aproblem exists, then a variety of different actions can be performed. Acorrective action module 213 can be initiated to take corrective actionwith respect to resources at the particular one or more managed nodesthat have been identified as having a problem. Further, if debugging isdesired, a debug module 214 can also be activated to interact with theparticular managed nodes to capture system data that can be utilized indebugging the particular system problems.

[0050] The knowledge processor 208 can periodically, or on a scheduledbasis, perform certain of the rules stored within the knowledge base206. The notification module 216 can also initiate the execution ofcertain rules when the notification module 216 receives an indicationfrom one of the agents 104 via the management framework interface 212.Typically, the agents 104 would communicate with the notification module216 using a notification that would specify a management condition thatthe agent 104 has sent to the manager 200 via the management framework106.

[0051] In addition, the manager 200 also includes a report module 218that can take the data acquired from the agents 104 as well as theresults of the processed rules (including debug data as appropriate) andgenerate a report for use by the user or administrator. Typically, thereport module 218 and its generated reports can be accessed by the useror administrator through the GUI 202. The manager 200 also includes alog module 220 that can be used to store a log of system conditions. Thelog of system conditions can be used by the report module 218 togenerate reports.

[0052] The manager 200 can also include a security module 222, aregistry 224 and a registry data store 226. The security module 222performs user authentication and authorization. Also, to the extentencoding is used, the security module 222 also perform encoding ordecoding (e.g., encryption or decryption) of information. The registry224 and the registry data store 226 serve to serve and store structuredinformation respectively. In one implementation, the registry data store226 serves as the physical storage of certain resource information,configuration information and compiled knowledge information from theknowledgebase206.

[0053] Still further, the manager 200 can include a notification system228. The notification system 228 can use any of a variety of differentnotification techniques to notify the user or administrator that certainsystem conditions exist. For example, the communication techniques caninclude electronic mail, a pager message, a voice message or afacsimile. Once notified, the notified user or administrator can gainaccess to a report generated by the report module 218.

[0054] The debug module 214 is able to be advantageously initiated whencertain conditions exist within the system. Such debugging can bereferred to as “just-in-time” debugging. This focuses the capture ofdata for debug purposes to a constrained time period in specific areasof interest such that more relevant data is able to be captured.

[0055]FIG. 3 is a block diagram of a GUI 300 according to one embodimentof the invention. The GUI 300 is, for example, suitable for use as theGUI 202 illustrated in FIG. 2.

[0056] The GUI 300 includes a knowledge input GUI 302, a report outputGUI 304, and an administrator GUI 306. The knowledge input GUI 302provides a graphical user interface that facilitates interaction betweena user (e.g., administrator) and a manager (e.g., the manager 200).Hence, using the knowledge input GUI 302, the user or administrator canenter rules, resources or situations to be utilized by the manager. Thereport output GUI 304 is a graphical user interface that allows the userto access reports that have been generated by a report module (e.g., thereport module 218). Typically, the report output GUI 304 would not onlyallow initial access to such reports, but would also provide a means forthe user to acquire additional detailed information about reportedconditions. For example, the report output GUI 304 could enable a userto view a report on chosen criteria such as case ID or a period of time.The administrator GUI 306 can allow the user to configure or utilize themanager. For example, the administrator GUI 306 can allow creation ofnew or modification to existing users and their access passwords,specific information about managed nodes and agents (includingmanaged-node IP and port, agent name, agent types), electronic mailserver and user configuration.

[0057]FIG. 4 is a block diagram of a knowledge manager 400 according toone embodiment of the invention. The knowledge manager 400 is, forexample, suitable for use as the knowledge manager 204 illustrated inFIG. 2.

[0058] The knowledge manager 400 includes a knowledge code generator402. In particular, the knowledge code generator 402 receives rules ordefinitions (namely, definitions for resources or situations) and thengenerates and outputs knowledge code to a knowledge processor, such asthe knowledge processor 208. In one implementation, the knowledge codegenerator 402 can be considered a compiler, in that the rules ordefinitions are converted into a data representation suitable forexecution. The knowledge code can be a program code or it can be ameta-language. In one implementation, the knowledge code is executableby an inference engine such as JESS. Additional information on JESS isavailable at “http://herzberg.ca.sandia.gov/jess” as an example.

[0059] The knowledge manager 400 also includes a knowledgeencoder/decoder 404, a knowledge importer/exporter 406 and a knowledgeupdate manager 408. The knowledge encoder/decoder 404 can performencoding when storing knowledge to the knowledge base 206 or decodingwhen retrieving knowledge from the knowledge base 206. The knowledgeimporter/exporter 406 can import knowledge from another knowledge baseand can export knowledge to another knowledge base. In general, theknowledge update manager 408 serves to manually or automatically updatethe knowledge base 206 with additional sources of knowledge that areavailable and suitable. In one embodiment, the knowledge update manager408 operates to manage the general coherency of the knowledge base 206with respect to a central knowledge base. Typically, the knowledge base206 stored and utilized by the knowledge manager 400 is only a relevantportion of the central knowledge base for the environment that theknowledge manager 400 operates.

[0060]FIG. 5A is a diagram of a directed graph 500 representing aknowledge base. The knowledge base represented by the directed graph 500is, for example, suitable for use as the knowledge base 206 illustratedin FIG. 2. The directed graph 500 represents a pictorial view of theknowledge code resulting from rules, situations and resources.

[0061] The directed graph 500 is typically structured to include baseresources at the top of the directed graph 500, situations/resources ina middle region of the directed graph 500, and actions (actionresources) at the bottom (or leaf nodes) of the directed graph 500. Inparticular, node 502 pertains to a base resource or resources and node504 pertains to situation and/or resource. A relationship 506 betweenthe nodes 502 and 504 is determined by the rule being represented by thedirectional arrow between the nodes 502 and 504. The situation/resourceat node 504 in turn relates to another situation/resource at node 508. Arelationship 510 relates the nodes 504 and 508, namely, the relationship510 is determined by the rule represented by the directional arrowbetween the nodes 504 and 508. The situations/resources at nodes 504 and508 together with the relationship 510 pertain to another rule. Thesituation/resource at node 508 is further related to an action resourceat node 512. A relationship 514 between the situation/resource at node508 and the action resource at node 512 is determined by still anotherrule, namely, an action rule.

[0062] The knowledge base represented by the directed graph 500 isflexible and extendible given the hierarchical architecture of thedirected graph 500. Hence, the knowledge base is able to grow over timeto add capabilities without negatively affecting previously existingknowledge within the knowledge base. The knowledge base is also able tobe divided or partitioned for different users, applications or serviceplans. In effect, as the knowledge base grows, the directed graph 500representation grows to add more nodes, such nodes representingsituations or resources as well as relationships (i.e., rules) betweennodes.

[0063]FIG. 5B represents a small portion of knowledge provided in asegment 520 of a directed graph (e.g., directed graph 500). The segment520 includes nodes 522, 526, 530 and 534, and relationships 524, 528 and532. The node 522 pertains to a resource, namely, heap size of JavaVirtual Machine (JVM) in use. The relationship 524 indicates that whenthe node 522 is triggered, the node 526 is triggered. The node 526pertains to a resource, namely, maximum heap size of JVM. Therelationship 528 evaluates whether the maximum heap size for JVM is lessthan 1/0.8 percent the heap size for JVM. When the relationship 528 istrue, then the node 530 is triggered to acquire a resource, namely,TopHeapObjects for JVM, which is a debugging resource that obtains theinformation about the objects that are consuming the most amount of JVMheap. The specifics of this resource include the resource consumptionselected by cumulative size or the number of objects, the count of thedistinct objects, the selection of objects by JAVA classes they belongto are described by the attributes of the resource. The relationship 532then always causes the node 534 to invoke a resource action, namely,initiating an allocation trace for JVM. The specifics of this resourceselectable by its attributes can include but not limited to the classesof objects to trace, the time-period for tracing, and the depth of stackto which to limit every trace.

[0064]FIG. 5C represents a small portion of knowledge provided in asegment 540 of a directed graph (e.g., directed graph 500). The segment540 includes nodes 542, 546, 550, 554 and 558, and relationships 544,548, 552 and 556. The node 542 pertains to a situation, namely, a JVMexception. The relationship 544 causes the node 546 to invoke a filteroperation when the situation at node 542 is present. The filteroperation at node 546 is a search expression that searches the JVMexception resource information received from agent 104 for an attribute“ORA-00018” which represents a particular problem with Oracle database,namely, the Oracle database running out of database connections for themanaged JAVA application to use. When the search expression is found,the relationship 548 causes the node 550 to trigger. At node 550, aresource for maximum users configured for the Oracle database being usedby the managed JAVA application is obtained. Then, the relationship 552determines whether the maximum users for the Oracle product is less thanfifty (50) and, if so, the node 554 invokes an action, namely, an emailnotification is sent. In addition, the relationship 556 always triggersthe node 558 to acquire a resource pertaining to the number of connectedusers the relevant Oracle database. The two rules, one rule representedby resources 542, 546, 550, 558 and the relationships 544, 548, 556, andthe second rule represented by the resources 550, 554 and therelationship 552 are two distinct rules defined using GUI 202 atdifferent times and possibly by different users and without needing toknow about the existence of the second rule while defining the first onerule and vice versa. The knowledgebase automatically links or chainsthese rules through the commonality of the resources (e.g., Oraclemaximum configured users resource 550 in the this example.

[0065]FIG. 6 is a block diagram of a knowledge processor 600 accordingto one embodiment of the invention. The knowledge processor 600 is, forexample, suitable for use as the knowledge processor 208 for the manager200 illustrated in FIG. 2.

[0066] The knowledge processor 600 includes a controller 602 thatcouples to a knowledge manager (e.g., the knowledge manager 204). Thecontroller 602 receives the knowledge code from the knowledge managerand directs it to an inference engine 604 to process the knowledge code.In one embodiment, the knowledge code is provided in an inferencelanguage such that the inference engine 604 is able to execute theknowledge code.

[0067] In executing the knowledge code, the inference engine 604 willtypically inform the controller 602 of the particular data to beretrieved from the managed nodes via the agents and the managementframework interface. In this regard, the controller 602 will request thedata via a management interface to a management framework. The returneddata from the managed nodes is then returned to the controller 602 viathe management interface 606. Alternatively, in executing the knowledgecode, exceptions (i.e., unexpected events) can be generated at themanaged nodes and pushed through the management interface 606 to thecontroller 602. In either case, the controller 602 then forwards thereturned data to the inference engine 604. At this point, the inferenceengine 604 can continue to process the knowledge code (e.g., rules). Theinference engine 604 may utilize a rule evaluator 608 to assist withevaluating the relationships or rules defined by the knowledge code. Therule evaluator 608 can perform not only the relationship checking forrules but also data parsing. Once the knowledge code has been executed,the inference engine 604 can inform the controller 602 to have variousoperations performed. These operations can include capturing ofadditional data from the managed nodes, initiating debug operations,initiating corrective actions, initiating logging of information, orsending of notifications.

[0068] The knowledge processor 600 also can include a scheduler 610. Thescheduler 610 can be utilized by the inference engine 604 or thecontroller 602 to schedule a future action, such as the retrieval ofdata from the managed nodes.

[0069]FIG. 7 is a block diagram of a management framework interface 700according to one embodiment of the invention. The management frameworkinterface 700 is, for example, suitable for use as the managementframework interface 212 illustrated in FIG. 2.

[0070] The management framework interface 700 includes a SNMP adapter702 and a standard management framework adapter 704. The SNMP adapter702 allows the management framework interface 700 to communicate usingthe SNMP protocol. The standard management framework adapter 704 allowsthe management framework interface 700 to communicate with any othercommunication protocols that might be utilized by standard managementframeworks, such as other product managers and the like. The managementframework interface 700 also includes an enterprise manager 706, adomain group manager 708, and an available domain/resources module 710.During startup of the management framework interface 700 (which istypically associated with an enterprise), the enterprise manager 706will identify all groups within the enterprise. Then, the domain groupmanager 708 will operate to identify all management nodes within each ofthe groups. Thereafter, the available domain/resources module 710 willidentify all domains and resources associated with each of theidentified domains. Hence, the domains and resources for a givenenterprise are able to be identified at startup so that the othercomponents of a manager (e.g., the manager 200) are able to make use ofthe available domains and resources within the enterprise. For example,a GUI can have knowledge of such resources and domains for improved userinteraction with the manager, and the knowledge processor can understandwhich rules within the knowledge base 206 are pertinent to theenterprise.

[0071] The management framework interface 700 also includes an incomingnotification manager 712. The incoming notification manager 712 receivesnotifications from the agents within managed nodes. These notificationscan pertain to events that have been monitored by the agents, such as asystem crash or the presence of a new resource. More generally, thesenotifications can pertain to changes to monitored data at the managednodes by the agents.

[0072] The management framework interface 700 also includes a managednode administrator module 714. The managed node administrator module 714allows a user or administrator to interact with the management frameworkinterface 700 to alter nodes or domains within the enterprise, such asby adding new nodes or domains, updating domains, reloading domains,etc.

[0073] Still further, the management framework interface 700 can alsoinclude a managed node update module 716. The managed node update module716 can discover managed nodes and thus permits a manager to recognizeand receive status (e.g., active/inactive) of the managed nodes.

[0074]FIG. 8 is a block diagram of a report module 800 according to oneembodiment of the invention. The report module 800 is, for example,suitable for use as the report module 218 illustrated in FIG. 2.

[0075] The report module 800 includes a presentation manager 802, aformat converter 804 and a report view selector 806. The presentationmanager 802 operates to process the raw report data provided by a logmodule (e.g., log module 220) in order to present an easily understood,richly formatted report. Such a report might include associatedgraphical components that a user can interact with using a GUI (e.g.,GUI 202). Examples of graphical components for use with such reports arebuttons, pull-down lists, etc. The format converter 804 can convert theraw report data into a format suitable for printing and display. Thereport view selector 806 allows viewing of partial or complete logdata/raw report data in different ways as selected using a GUI. Theseviews can, for example, includes one or more of the following types ofreports: (1) Report Managed nodes wise—show report for the selectedmanaged node/process identifier only; (2) Report time wise—show reportfor the last xyz hours (time desired by the user), with the user havingthe option of choosing the managed node he wants to view; (3) ReportRule wise—show report for the selected rule that might be applicable fornumber of JVM instances; (4) Report Rule pack wise—show report for allthe rules fired under a particular rule pack; (5) Report Last FiredRules wise—show report for rules fired after last re-start of theinference engine; (6) Report Rule Fired Frequency wise—show report forrules fired as per selected fired frequency (e.g., useful to getrecurrence pattern of event occurrence); (7) Report Domain wise—showreport pertaining to a particular domain (e.g., if a rule is composed ofmultiple domains, in that case this report can show the rules includingthe selected domain. e.g., JVM); (8) Report Resource wise—show reportfor all rules including a particular resource under the domain,e.g.,—jvm_Exception); (9) Report filter wise—show report pertaining torules having similar filter conditions; (10) Report Day wise—show reportfor all events happened in a day; (11) Report Refreshed Values wise—shownext refreshed state of the same report and highlights changed/addedrecords; (12) Report Case ID wise—show the report based on problem caseidentifier (id); and (13) Customized Structure reports—allow user toselect a combination of the above or provide a report filter of theirown.

[0076]FIG. 9A is a diagram illustrating a knowledge base 900 accordingto one embodiment of the invention. The knowledge base 900 is, forexample, suitable for use as the knowledge base 206 illustrated in FIG.2 or the knowledge base 500 illustrated in FIG. 5A. The architecture forthe knowledge base 900 renders the knowledge base 900 well-suited to bemanaged, deployed and scaled. The knowledge base 900 typically resideswithin a manager, such as the manager 200 illustrated in FIG. 2.However, the knowledge base 900 can also be distributed between amanager and managed nodes, such that the processing load can be likewisedistributed.

[0077] The knowledge base 900 includes one or more knowledge domains andone or more rule packs. In particular, the knowledge base 900illustrated in FIG. 9A includes knowledge domain A 902, knowledge domainB 904 and knowledge domain C 906. Through use of the rule packs, thesemultiple knowledge domains 902, 904, and 906 can be linked together soas to effectively operate to concurrently cooperate with one another. Aparticular knowledge domain is a software representation of know-howpertaining to a specific field (or domain). The knowledge domains can bephysical domains and/or virtual domains. A physical domain oftenpertains to a particular managed product. A virtual domain can pertainto a defined set of resources defined by a user to achieve effectivemanageability.

[0078] The knowledge base 900 also includes rule packs 910 and 912.These rule packs (or knowledge rule packs) are collections of rules(i.e., relationships between different kinds of resources/situations).The purpose of the rule packs is to collect the rules such thatmanagement modification and tracking of knowledge is made easier. Byseparating knowledge into domains and rule packs, each knowledgecomponent can be individually tested as well as tested together withother knowledge components. In other words, each domain or rule pack isa logically separate piece of knowledge which can be installed anduninstalled as desired.

[0079]FIG. 9B is an architecture diagram for a rule pack 914 accordingto one embodiment of the invention. The rule pack 914 includes rules916, facts 918 and functions 920. The rule pack 914 depends on the facts918 for its reasoning, a set of facts that it generates, a set offunctions 920 that it calls upon, and a set of rules 916 that act toread and write facts and perform the functions.

[0080] When a rule pack is installed, the system must keep track of itsrules, functions, inputs and outputs so that a large installed base ofrule packs can be managed. Hence, an individual rule pack can be addedto or removed from the knowledge base without adversely affecting theentire system.

[0081] Further, two rule packs may operate on the same set of sharedfacts. The two knowledge rule packs may also generate a set of sharedfacts. These rule packs can facilitate the tracking of how a facttravels through various rule packs, and how a fact may be generated bymultiple rule packs. The functions and rules of rule packs can also bemore precisely monitored by using the smaller sized rule packs. It isalso possible for one rule to exist in two or more rule packs. Hence,when such two or more rule packs that share a rule are merged into aknowledge base, only one copy of the rule need exist within theknowledge base.

[0082] An expert system object manages the knowledge base. For example,the expert system object can reset an inference engine, load and unloadrule packs or domains, insert or retract runtime facts, etc.

[0083] The knowledge representation utilized by the present inventionmakes use of three major components: facts, rules and actions.Collectively, these components are utilized to perform the tasks ofmonitoring and managing a computer resource, such as a JVM, an operatingsystem, a network, database or applications.

[0084]FIG. 10 illustrates a relationship 1000 between facts 1002, rules1004 and actions 1006. According to the relationship 1000, facts 1002trigger rules 1004. The rules 1004 that are triggered, cause the actions1006. The actions 1006 then may cause additional facts to be added tothe repository of the facts 1002. A fact can be considered a record ofinformation. One example of a fact is the number of threads running in aJVM. Another example of a fact is an average load on a CPU. Rules arepresented as “if—then” statements. In one embodiment, the left-hand sideof the “if—then” statement can have one or more patterns, and theright-hand side of the “if—then” rule can contain a procedural list ofone or more actions. The patterns are used as conditions to search for afact in the repository of the facts 1002, and thus locate a rule thatcan be used to infer something. The actions are functions that perform atask. As an example, the actions can be considered to be statements thatwould otherwise be used in the body of a programming language (e.g.,JAVA or C programs). As another example, the actions can be used toobtain debug information using a resource.

[0085] The rules 1004 can be represented in JAVA Expert Systems Shell(JESS) and as a rule engine that drives these rules. JESS offers aCLIPS-like language for specifying inference rules, facts and functions.The relationship 1000 thus facilitates the creation of a data-drivenknowledge base that is well-suited for monitoring and managing computerresources.

[0086]FIG. 11 illustrates an object diagram 1050 for a representativeknowledge representation. The object diagram 1050 includes a rule pack 1inference object 1052 and a rule pack 2 inference object 1054. Aninference object for a rule pack encompasses the rules written for thatknowledge domain(s) and a rules engine can then read and execute theserules. A JESS package can be utilized to provide this functionality.Surrounding each of the inference objects 1052 and 1054 are domain factsand domain actions. Although the arrangement of the rule packs shown inFIG. 11 is such that the rule packs pertain to a particular domain, rulepacks can also be arranged to pertain to multiple domains.

[0087] The relationship between a domain fact and an inference object isalways an arrow pointing from the fact to the inference object, therebydenoting that facts are “driving” the rules inside the inference engine.The relationship between the inference object and the actions are thatof an arrow pointing from the inference object toward the action—meaningthe inference rules “drive” the actions. Between the two inferenceobjects 1052 and 1054 are facts and actions that both inference objects1052 and 1054 utilize. In effect, these inference objects 1052 and 1054are cooperative expert systems, namely, expert systems that cooperate ina group by sharing some of their knowledge with one another.

[0088] Facts can be used to represent the “state” of an expert system insmall chunks. For example, a fact may appear as “MAIN::jvm—jvm_heapused(v “3166032”) (uid “372244480”) (instance “13219”) (host unknown)” Thecontent of the fact indicates that in the current Java Virtual Machine(JVM) on system “unknown” with instance or process id 13219, the size ofheap used is 3166032 bytes. In this example, uid, instance and host aresome of the attributes of the resource jvm_heapused belonging to thedomain jvm. The attributes of a resource that are not used forcomparison with other resources, need not be included in the facts forthe resource. Facts, as implemented by JESS, exist inside the rulesengine. To add an additional fact into the rules engine, the new fact isinjected into the inference engine object. The repository of facts canbe represented hierarchically. The knowledge base can, for example, besorted and transmitted as needed as a set of XML documents or providedas shared distributed databases using LDAP or as JAVA Properties files.

[0089] In the case of a cooperative expert system, access to a sharedset of facts is needed. The facts can be logically organized intoseparate domains. In one implementation, a user may choose to organizeshared knowledge into separate knowledge rule packs, or alternatively,allow the same fact definition to exist within multiple rule packs. Inthe later approach, the system can manage the consistency of the factsusing a verification process at the managed resource node (in the formof capability requests) and at the knowledge control module (in the formof definition verification).

[0090] The rules are used to map facts into actions. Rules arepreferably domain-specific such that separate domains of knowledge arethus provided as modular and independent rule sets. Hence, themodification of one domain of rules and its internal facts would notaffect other domains. These different rule packs of rules interact witheach other only through shared facts.

[0091] An example of a rule implemented using JESS is as follows:

[0092] (Defrule default-jvm-memory-leak-detect (jvm-jvm_heapused (v ?r1)(uid ?uid) (instance ?instance) (host ?host)) (test (> ?r1 1000000) =>(...some actions...) )

[0093] The “default—” prefix denotes the rule pack the rule belongs to.Since it is possible that memory leak can exist for application orapplication server, utilizing separate name spaces for each rule pack ofrules allows separation of these rules into different rule packs.Another advantage of using separate name space for different rule packsis that JESS rules are serializable, meaning that text rules can beencoded into binary form. The ability to store rules in binary formserves to protect the intellectual property encoded within the rules.

[0094] Actions are procedural statements to be executed. The actions mayreside on the right-hand side of rules in the form of scripts or can beembedded as methods inside programming objects (e.g., JAVA objects). Inthe case of scripts, the scripts are inference engine-dependent suchthat different inference engines would utilize different scripts becauseof the different languages utilized by the inference engines. In thecase of programming objects, the actions are functions. For example,actions in JAVA can be implemented by registering them as new JESSfunctions. Alternatively, the functions could be packaged inside factobjects for which such rules are relevant. The functions could in turnrequest relevant resource values from the managed nodes and assert thevalues obtained as facts into the inference engine. The fact objects(e.g., get values) represent values obtained from agents (e.g., using ascheduler of an agent).

[0095] Given that actions can be complicated and not tied to anyparticular facts, it is often more efficient to create a global objectfor a domain and include the methods or functions for actions thereinsuch that every rule within a rule pack has access to the actions.

[0096] Through the use of a modular design, the system becomes easier tomanage even when thousands of rules and facts exist. By separating rulesinto rule packs and facts into domains, and making it difficult fordomains to interfere with one another, the expert system is effectivelydivided into smaller modular pieces. Additionally, through use of JESS'sbuilt-in watch facility, the system can track those rules that havefired and the order in which they have fired. This watch facility thusprovides a limited tool for debugging a knowledge system. Groups ofrules can be isolated for inspection by turning off other rules. Rulescan be turned off by deactivating those inference objects from firingwhich are not desired. If one were to desire to debug a set of rulesrelated to one domain, such a set of rules could be manually groupedinto a logical group (e.g., rule pack) and user of the management systemcan use GUI 202 to control the activation of each group. Using GUI 202,user can additionally control activation of a single or a selected setof rules within a rule pack.

[0097] Initialization scripts can be used to set up all the componentsneeded for a rule pack. The setup can operate to create the inferenceobject, load the rules, create initial facts, create action objects, andlink all the objects together so that they can inter-operate.

[0098] In the JESS/JAVA implementation, one inference object may containrules from one or more rule packs. Outside the inference object areobjects that represent facts and objects that encapsulate actions. Eachinference object is attached to a set of facts and actions. The rulesengine searches the facts for matches that can trigger a rule to fire.Once a rule is fired, one or more action objects being linked theretoare invoked. Actions can also be explicitly linked by using aninitialization that involves JAVA object creation and passing handles tothese objects to appropriate JESS inference objects.

[0099] One useful aspect of the rule engine design is the ability of thesystem to manage different combinations of multiple products on multiplenodes using one set of rule packs and one manager. This simplifies thedistribution, configuration and manageability of rule packs on per-userbasis. For example, the rules engine can have rule packs for managedproducts JVM and Oracle loaded, but one managed node may not have Oracleas the managed product. In this case, naturally there will be no factscorresponding to Oracle resources for the managed node asserted into theinference engine and hence the rules using those Oracle resources willnot be active for the managed node without Oracle as a managed product.Note that the information about the managed node is part of the factrepresenting any Oracle resource.

[0100] Another useful aspect of the rules engine design is the implicitchaining of rules by the inference engine. A user of the system definesindividual rules representing a problem or diagnostic “cases”. Thesystem combines these individual rules based on the use of common factsrepresenting resources. For example, one rule can be, represented in ameta-language, “IF (jvm—uncaught_exception ANDfilter—exception_is_Oracle_connections_exhausted) THEN (getOracle—max_connections_configured)”. A second rule can be, representedin a meta-language, “IF (Oracle—max_connections_configured<50) THEN(email dba)”. When the inference engine is running, if the jvm_uncaughtexception gets asserted into the inference engine and if the assertedfact contains the Oracle_connections_exhausted status, then themanagement system will obtain the Oracle—max_connections_configuredresource from the same managed node as described by the host attributeof the exception resource. On request from the interface engine, thecorresponding fact will be asserted into the inference engine. Theinference engine will now automatically detect the second ruledefinition using the Oracle—max_connections_configured resource and thesecond rule will automatically get into action. It will check if thefact value representing the Oracle—max_connection_configured resource isgreater than 50 and, if so, it will automatically send electronic mailto the dba.

[0101]FIG. 12 is a block diagram of the managed node 1200 according toone embodiment of the invention. The managed node 1200 is, for example,suitable for use as one or more of the managed nodes 102 illustrated inFIG. 1.

[0102] The managed node includes a plurality of different managedproducts 1202. In particular, the managed node 1200 includes managedproducts 1202-1, 1202-2, . . . , 1202-n. These managed products 1202 aresoftware products that form part of the system being managed by amanagement system. The managed products can vary widely depending uponimplementation. As examples, the managed products can pertain to aSolaris operating system, an Oracle database, or a JAVA application.

[0103] The managed node 1200 also includes an agent 1204. The agent 1204couples to each of the managed products 1202. The agent 1204 alsocouples to a manager (e.g., the manager 108 illustrated in FIG. 1) viathe management framework 106. In general, the agent 1204 can interactwith the managed products 1202 such that the managed products 1202 canbe monitored and possibly controlled by the management system via theagent 1204.

[0104] Additionally, in one embodiment, one or more of the managedproducts 1202 can include an application agent 1206. For example, asshown in FIG. 12, the managed product N 1202-n includes the applicationagent 1206. Here, the application agent 1206 resides within the processspace of the managed product N 1202-n (and thus out of the process spaceof the agent 1204). The application agent 1206 can render the managedproduct N 1202-n more manageable by the agent 1204. For example, theapplication agent 1206 can enable any JAVA application to be managed.The capabilities of the application agent 1206 can be further enhancedby the user adding application code to the application agent conformingto the Application Programming Interfaces (API) provided by theapplication agent 1206. This methodology provides a convenient means forthe user to add his/her application specific information such that itbecomes available as resources to the rest of the management system.

[0105]FIG. 13 is a block diagram of an agent 1300 according to oneembodiment of the invention. The agent 1300 is, for example, suitablefor use as the agent 1204 illustrated in FIG. 12.

[0106] The agent 1300 includes a master agent 1302 that couples to aplurality of sub-agents 1304. In particular, the agent 1300 utilizes Nsub-agents 1304-1, 1304-2, . . . , 1304-n. Each of the sub-agents1304-1, 1304-2, . . . , 1304-n respectively communicates with themanaged products 1202-1, 1202-2, . . . , 1202-n shown in FIG. 12. Themaster agent 1302 thus interacts with the various managed products 1202through the appropriate one of the sub-agents 1304. The master agent1302 includes the resources that are shared by the sub-agents 1304.These shared resources are discussed in additional detail below withrespect to FIG. 14. The master agent 1302 also provides an ApplicationProgramming Interfaces (API) that can be used by the user to write asub-agent that can interact with a managed product for which a sub-agentis not provided by the management product. Using this API, theuser-written sub-agent can make available the managed product specificinformation as resources to the rest of the management product includingthe master agent 1302 and the manager 108.

[0107] The agent 1300 also includes a communication module 1306. Thecommunication module 1306 allows the agent 1300 to communicate with amanagement framework (and thus a manager) through a variety of differentprotocols. In other words, the communication module 1306 allows theagent 1300 to interface with other portions of a management system overdifferent protocol layers. These communication protocols can bestandardized, general purpose protocols (such as SNMP), orproduct-specific protocols (such as HPOV-SPI from Hewlett-PackardCompany) or various other proprietary protocols. Hence, thecommunication module 1306 includes one or more protocol communicationmodules 1308. In particular, as illustrated in FIG. 13, thecommunication module 1306 can include protocol communication modules1308-a, 1308-b, . . . , 1308-m. The protocol A communication module1308-a interfaces to a communication network that utilizes protocol A.The protocol B communication module 1308-b interfaces with acommunication network that utilizes protocol B. The protocol Mcommunication module 1308-m interfaces with a communication network thatutilizes protocol M.

[0108]FIG. 14 is a block diagram of a master agent 1400 according to oneembodiment of the invention. The master agent 1400 is, for example,suitable for use as the master agent 1302 illustrated in FIG. 13.

[0109] The master agent 1400 includes a request processor 1402 thatreceives a request from the communication module 1306. The request isdestined for one of the managed products 1202. Hence, the requestprocessor 1402 operates to route an incoming request to the appropriateone of the sub-agents 1304 associated with the appropriate managedproduct 1202. Besides routing a request to the appropriate sub-agent1304, the request processor 1402 can also perform additional operations,such as routing return responses from the sub-agents 1304 to thecommunication module 1306 (namely, the particular protocol communicationmodule 1308 that is appropriate for use in returning the response to thebalance of the management system, i.e., the manager).

[0110] The master agent 1400 typically includes a registry 1404 thatstores registry data in a registry data store 1406. The registry 1404manages lists which track the sub-agents 1304 that are available for usein processing requests for notification to the sub-agents 1304 or theprotocol communication modules 1308. These lists that are maintained bythe registry 1404 are stored as registry data in the registry data store1406. Hence, the registry 1404 is the hub of the master agent 1400 forall traffic and interactions for other system components carried out atthe agent 1300. The functionality provided by the registry 1404 includes(1) a mechanism for sub-agent registration, initialization, and dynamicconfiguration; (2) a communication framework for the sub-agent'sinteraction with the manager node through different communicationmodules present at the agent; (3) a notification mechanism forasynchronous notification delivery from the monitored systems andapplications to the communication modules and the manager node; and (4)a sub-agent naming service so that sub-agents can be addressed by usingsimple, human-readable identifiers. The registry 1404 also acts as aninterface between the communication modules 1308 so that thecommunication modules 1308 are able to configure registered sub-agentsand receive asynchronous notifications from the registered sub-agents.

[0111] The master agent 1400 also includes a scheduler 1408 andstatistical analyzer 1410. The scheduler 1408 can be utilized toschedule requests in the future to be processed by the request processor1402. The statistical analyzer 1410 can be utilized to process (or atleast pre-process) the response data being returned from the managedproduct 1202 before some or all data is returned to the manager. Hence,by having the master agent 1400 perform certain statistical analysis atthe statistical analyzer 1410, the processing load on the manager can bedistributed to the master agents.

[0112] Each of the sub-agents 1304 can be a pluggable componentenclosing monitoring and control functionality pertinent to a singlesystem or application. The sub-agents 1304 are known to the managedproducts through the registry 1404. In other words, each of thesub-agents 1304 is registered and initialized by the registry 1404before it can receive requests and send out information about themanaged product it monitors. The principal task of the sub-agent 1304 isto interact with the managed product (e.g., system/application) itcontrols or monitors. The sub-agent 1304 serves to hide much interactiondetail from the rest of the agent 1300 and provides only a few entrypoints for request into the information.

[0113] The different protocols supported by the communication module1306 allow the communication module 1306 to be dynamically extended tosupport additional protocols. As a particular protocol communicationmodule 1308 is initialized, the registry 1404 within the master agent1400 is informed of the particular protocol communication module 1308 sothat asynchronous notifications from the managed objects can be receivedand passed to the manager via the particular protocol communicationmodule 1308.

[0114] The communication module 1306 receives requests from a managerthrough the protocol supported by the particular protocol communicationmodule 1308 that implements and forwards such requests to theappropriate sub-agent 1304 corresponding to the appropriate managednode. The registry 1404 within the master agent 1400 is utilized toforward the request from the protocol communication module 1308 and thesub-agents 1304.

[0115] In addition, the protocol communication module 1308 also providesa callback for the sub-agents 1304 such that notifications are able tobe received from the managed product and sent back to the manager. Ifsuch callbacks are not provided, the notifications will be ignored bythe sub-agents 1304 and, thus, no error will be reported to the manager.Hence, each of the protocol communication modules 1308 can be configuredto handle or not handle notifications as desired by any particularimplementation.

[0116]FIG. 15 is a block diagram of a sub-agent 1500 according to oneembodiment of the invention. The sub-agent 1500 is, for example,suitable for use as any of the sub-agents 1304 illustrated in FIG. 13.

[0117] The sub-agent 1500 includes a get resource module 1502, a setoperation module 1504, and an event forwarding module 1506. The getresource module 1502 interacts with a managed product to obtainresources being monitored by the managed product. The set operationmodule 1504 interacts with the managed product to set or control itsoperation. The event forwarding module 1506 operates to forward eventsthat have occurred on the managed product to the manager. In addition,the sub-agent 1500 can further include a statistical analyzer 1508. Thestatistical analyzer 1508 can operate to perform statistical processingon raw data provided by a managed product at the sub-agent level. Hence,although the master agent 1400 may include the statistical analyzer1410, the presence of statistical analyzer 1508 in each of thesub-agents 1500 allows further distribution of the processing load forstatistical analysis of raw data.

[0118]FIGS. 16A and 16B are flow diagrams of manager startup processing1600 according to one embodiment of the invention. The manager startupprocessing 1600 initially loads 1602 a knowledge base. The manager is,for example, the manager 200 illustrated in FIG. 2 and includes aknowledge base, such as the knowledge base 206 illustrated in FIG. 2.Once the knowledge base is loaded 1602, third-party managementframeworks are discovered 1604. In one implementation, a managementframework interface, such as the management framework interface 212illustrated in FIG. 2, is utilized to identify and establish aninterface to all available third-party management frameworks. Next, alist of node groups is obtained 1606. In one implementation, the list ofnode groups is retrieved by the management framework interface.

[0119] Next, a first node group is selected 1608 from the list of nodegroups. For the selected node group, a list of nodes within the selectednode group is obtained 1610. A decision 1612 then determines whetherthere are more node groups to be processed. When the decision 1612determines that there are more node groups to be processed, then themanager startup processing 1600 returns to repeat the operations 1608and 1610 for a next node group. When the decision 1612 determines thatthere are no more node groups to be processed, all the nodes within eachof the node groups have thus been obtained.

[0120] At this point, processing is performed on each of the nodes. Afirst node from the various nodes that have been obtained is selected1614. Then, a list of domains within the selected node is obtained 1616.A decision 1618 then determines whether there are more nodes to beprocessed. When the decision 1618 determines that there are more nodesto be processed, then the manager startup processing 1600 returns torepeat the operations 1614 and 1616 for a next node.

[0121] On the other hand, when the decision 1618 determines that thereare no more nodes to be processed, then processing can be performed foreach of the domains. At this point, the manager startup processing 1600performs processing on each of the domains that have been obtained. Inthis regard, a first domain is selected 1620. Then, a list of supportedresources is obtained 1622 for the selected domain. A decision 1624 thendetermines whether all of the domains that have been identified havebeen processed. When the decision 1624 determines that there areadditional domains to be processed, the manager startup processing 1600returns to repeat the operations 1620 and 1622 for a next domain suchthat each domain can be similarly processed.

[0122] Next, processing is performed with respect to each of the nodes.At this point, a first node is selected 1626. Then, a customizedknowledge base is produced 1628 for the selected node based on thesupported resources for the selected node. In other words, thegeneralized knowledge base that is loaded 1602 is customized atoperation 1628 such that a customized knowledge base is provided foreach node that is active or present within the system being managed. Adecision 1630 then determines whether there are more nodes to beprocessed. When the decision 1630 determines that there are more nodesto be processed, then the manager startup processing 1600 returns torepeat the operations 1626 and 1628 for a next node. Alternatively, whenthe decision 1630 determines that there are no more nodes to beprocessed, then data acquisition for those base rules within thecustomized knowledge bases can be scheduled 1632. Once the dataacquisition has been scheduled 1632, the manager startup processing 1600is complete and ends.

[0123] FIGS. 16C-16E are flow diagrams of manager startup processing1650 according to another embodiment of the invention. The managerstartup processing 1650 initially loads 1652 a knowledge base withresources, rule packs and configuration information. The manager is, forexample, the manager 200 illustrated in FIG. 2 and includes a knowledgebase, such as the knowledge base 206 illustrated in FIG. 2. Once theknowledge base is loaded 1652, a list of node groups is obtained 1654.

[0124] A decision 1656 then determines whether there are any node groupsto be processed. When the decision 1656 determines that there are nodegroups to be processed, then a first node group is selected 1658. Then,a list of nodes within the selected node group is obtained 1660.

[0125] Next, a decision 1662 determines whether there are any nodes inthe selected node group that are to be processed. When the decision 1662determines that there are nodes within the selected node group to beprocessed, then a first node is selected 1664. Then, for the selectednode, a list of agent types on the selected node is obtained 1668.

[0126] A decision 1670 then determines whether there are any agent typesto be processed. When the decision 1670 determines that there are agenttypes to be processed, a first agent type is selected 1671. Then, forthe selected agent type, a decision 1672 determines whether there is anythird party framework adapter. When the decision 1672 determines thatthere is no third party framework adapter, then a list of domains isobtained 1674. On the other hand, when the decision 1672 determines thatthere is a third party framework adapter, then a list of supporteddomains is discovered 1676. Here, the resulting list of supporteddomains includes information about product(s) supported by the thirdparty adapter. The concept of domain in this case is adapter-specific.For example, for SNMP adapter, all resources supported by the SNMPmaster agent on a managed node can be considered belonging to a domain.Another concept of domain for SNMP adapter can correspond to theresources supported by every SNMP sub-agent on the managed nodecommunicating with the SNMP master agent

[0127] Following the operations 1674 and 1676, a decision 1678determines whether there are any domains within the selected agent type.When the decision 1678 determines that there are domains, then a firstdomain is selected 1680. Then, a list of supported resources and domainversion are obtained 1682. Next, a decision 1684 determines whetherthere are more domains within the selected agent type. When the decision1684 determines that there are more domains, then the manager startupprocessing 1650 returns to repeat the operation 1680 and subsequentoperations so that a next domain can be similarly processed.

[0128] Alternatively, when the decision 1684 determines that there areno more domains within the selected agent type to be processed, as wellas directly following the decision 1678 when there are no domains to beprocessed, a decision 1686 determines whether there are more agent typesto be processed. When the decision 1686 determines that there are moreagent types to be processed, then the manager startup processing 1650returns to repeat the operation 1671 and subsequent operations so that anext agent type can be similarly processed.

[0129] On the other hand, when the decision 1686 determines that thereare no more agent types to be processed, or directly following thedecision 1670 when there are no agent types to be processed, a decision1688 determines whether there are more nodes to be processed. When thedecision 1688 determines that there are more nodes to be processed, thenthe manager startup processing 1650 returns to repeat the operation 1664and subsequent operations so that a next node can be similarlyprocessed.

[0130] Alternatively, when the decision 1688 determines that there areno more nodes to be processed, or directly following the decision 1662when there are no nodes, a decision 1690 determines whether there aremore node groups to be processed. When the decision 1690 determines thatthere are more node groups to be processed, the manager startupprocessing 1650 returns to repeat the operation 1658 and subsequentoperations so that a next node group can be similarly processed.

[0131] On the other hand, when the decision 1690 determines that thereare no more node groups to be processed, or directly following thedecision 1656 when there are no node groups, a customized domain andresources list is produced 1692 based on available domains (and theirversions) and resources information for rules input. Then, a customizedknowledge base is produced 1694 for the selected nodes based onsupported domains and resources.

[0132] A reference resource list can be created using themost-up-to-date version of each domain type. The reference resource listis used in rule definitions. For example, a JVM domain list of resourcesobtained from one managed node may be larger in number than the list ofresources obtained for the JVM domain from a different managed node.This is possible because of enhancement of agent 1204 over time. Thereference resource list contains the maximal set of domains andresources from the latest version of all the knowledge domains byname/type. This enables user to define rules for the most completemanageability of the user environment 100 (e.g., using one GUI).

[0133] Next, a decision 1696 determines whether a knowledge processorhas been selected to run. The decision 1696 enables user to start themanagement system for development and testing of rules and also, tosetup all the managed nodes and select a set rule packs and rules priorto running the knowledge processor. The decision 1696 can be facilitatedby a GUI. When the decision 1696 determines that the knowledge processoris to be run, then data acquisition for those base rules within thecustomized knowledge base can be scheduled 1698. Alternatively, when thedecision 1696 determines that the knowledge processor is not selected torun, then the operation 1696 can be bypassed. Following the operation1696, or its being bypassed, the manager startup processing 1600 iscomplete and ends.

[0134]FIG. 17A is flow diagram of master agent startup processing 1700according to one embodiment of the invention. A managed node includes anagent to assist the management system in monitoring and managing themanaged node. In one embodiment, the agent includes a master agent and aplurality of sub-agents. Hence, the master agent startup processing 1700pertains to startup processing that is performed by a master agent. Themaster agent is, for example, the master agent 1302 illustrated in FIG.13.

[0135] The master agent startup processing 1700 initializes 1702 anypre-configured sub-agents for the master agent. Hence, any standardsub-agents for the master agent are initialized 1702. Then, the presenceof any other sub-agents for the master agent are discovered 1704. Theseother sub-agents can be either in-process process or out-of-process. Anin-process sub-agent would operate in the same process as the masteragent. On the other hand, an out-of-process sub-agent would operate in aseparate process from that of the master agent. After the any othersub-agents are discovered 1704, the discovered sub-agents areinitialized 1706. A statistical analyzer can then be activated 1708 foreach of the sub-agents. The statistical analyzers provide the statisticscollection for the resources being monitored by the respectivesub-agents. Following the operation 1708, the master agent startupprocessing 1700 is complete and ends.

[0136]FIG. 17B is a flow diagram of sub-agent startup processing 1750according to one embodiment of the invention. The sub-agent startupprocessing 1750 is performed by a sub-agent. For example, the sub-agentcan be one of the sub-agents 1304 illustrated in FIG. 13.

[0137] The sub-agent startup processing 1750 initially establishes 1752a connection with the master agent. The connection is an interface or acommunication link between the master agent and the sub-agent.Application resources are then discovered 1754. The applicationresources are those resources that are available from an applicationmonitored by the sub-agent. The application resources can also includeuser-defined resources, e.g., using an API. Next, the master agent isnotified 1756 of the status of the sub-agent. The status for thesub-agent can include various types of information. For example, thestatus of the sub-agent might include the resources that are availablefrom the sub-agent, details about the version or operability of thesub-agent, etc. Next, a statistical analyzer can be activated 1758 forthe sub-agent. The statistical analyzer allows the sub-agent to performstatistical analysis on resource information available from thesub-agent. Following the operation 1758, the sub-agent startupprocessing 1750 is complete and ends. It should, however, be recognizedthat the sub-agent's startup processing 1750 is performed for each ofthe sub-agents associated with the master agent.

[0138]FIGS. 18A and 18B are flow diagrams of trigger/notificationprocessing 1800 according to one embodiment of the invention. Thetrigger/notification processing 1800 is, for example, performed by amanager, such as the manager 108 illustrated in FIG. 1. In particular,the trigger/notification processing 1800 operates to trigger processingso that management information can be recorded and utilized, includinginitiation of notifications as appropriate.

[0139] The trigger/notification processing 1800 begins with a decision1802 that determines whether a new fact has been asserted. When thedecision 1802 determines that a new fact has not been asserted, then adecision 1804 determines whether a notification has been received. Here,the notifications could arrive from managed nodes. When the decision1804 determines that a notification has not been received, then thetrigger/notification processing 1800 returns to repeat the decision1802. Once the decision 1802 determines that a new fact has beenasserted or when the decision 1804 determines that a notification hasbeen received, then a fact is asserted 1806 in the inference engine. Theinference engine then processes the fact in the manager. For example, inthe case of the manager 200 illustrated in FIG. 2, the inference engineis implemented by the knowledge processor 208. Next, a log entry is made1808 into a log. The log entry indicates at least that the fact wasasserted 1806.

[0140] Next, updated facts are retrieved 1810 for one or more rules thatare dependent upon the asserted fact. Hence, the inference enginereceives the asserted fact and determines which of the rules aredependent upon the asserted fact, and then for such rules, requestsupdated facts so that the rules can be fully and completely processedusing up-to-date information.

[0141] Following the operation 1810, a decision 1812 determines whetherthe trigger/notification processing 1800 should stop. When the decision1812 determines that the trigger/notification processing 1800 shouldstop, then those facts no longer needed are discarded 1813. Followingthe operation 1813, the trigger/notification processing 1800 is completeand ends. For example, a user might terminate the operation of themanager and thus end the trigger/notification processing 1800.

[0142] Alternatively, when the decision 1812 determines that thetrigger/notification processing 1800 should not stop, then additionalprocessing is performed depending upon the type of resource. Forexample, the resource or the rule being processed can signal for dataacquisition, corrective action or debug operations. In particular, adecision 1814 determines whether data acquisition is requested. When thedecision 1814 determines that data acquisition has been requested, thenan updated fact is selected 1816. On the other hand, when the decision1814 determines that data acquisition is not being requested, then adecision 1818 determines whether corrective action is indicated. Forexample, a rule within the knowledge base can request a correctiveaction be performed. In any case, when the decision 1818 determines thata corrective action has been requested, then the corrective action isperformed 1820.

[0143] Alternatively, when the decision 1818 determines that acorrective action is not being requested, then a decision 1822determines whether debug data is being requested. When the decision 1822determines that debug data is requested, then debug data is obtained1824.

[0144] Alternatively, when the decision 1822 determines that debug datais not being requested, then a decision 1828 determines whether auser-defined situation has occurred. When the decision 1828 determinesthat a user-defined situation has occurred, then an action 1830 is takennoting the occurrence of the user-defined situation.

[0145] Following any on the operations 1816, 1820, 1824, 1830 or thedecision 1828 when a user-defined situation is not present, a log entryis made 1826 into the log. The log entry indicates the firing of therule along with the specifics of the resources (including their values)on the left-hand-side (or “if” part of the rule). Following the loggingoperation 1826, the trigger/notification processing 1800 returns torepeat the operation 1806 and subsequent operations so that additionalfacts can be asserted and similarly processed.

[0146] Additionally, a user of the management system may interact with aGraphical User Interface (GUI) to request a report. The report providesinformation to the user about the management state of the one or moremanaged products within the enterprise or computer system beingmonitored.

[0147]FIG. 19 is a flow diagram of GUI report processing 1900 accordingto one embodiment of the invention. The GUI report processing 1900 is,for example, performed by a manager. For example, the manager can be themanager 200 illustrated in FIG. 2.

[0148] The GUI report processing 1900 can begin with a decision 1902that determines whether a report has been requested. When the decision1902 determines that a report has not yet been requested, the GUI reportprocessing 1900 awaits such a request. In other words, the GUI reportprocessing 1900 can be considered to be invoked once a report requesthas been received. In any case, when the decision 1902 determines that areport request has been received, then log data is retrieved 1904. Forexample, with respect to the manager 200 illustrated in FIG. 2, the logdata can be retrieved 1904 from the log module 220. After the log datais retrieved 1904, a report is generated 1906 from the retrieved logdata.

[0149] The report might indicate the various facts and rules that havebeen utilized by the management system over a period of time. Forexample, a report might specify those of the rules that were “fired” andfor each such rules, when it “fired,” why it “fired,” and action (ifany) taken. Additionally, a report might include details on the actionstaken and related values. Still further, if one of the actions taken isa debug action, then the report might also include debug data. A reportcan also be targeted or selective in its content based on criteria. Forexample, a report can be limited with respect to one or more of acertain time range, an event, exceptions, domains and/or rule packs.

[0150] Once the report has been generated 1906, a report delivery methodis determined 1908. Here, the report delivery method can bepre-configured by an administrator of the management system to deliverreports to certain individuals or locations automatically. For example,the report can be delivered in the form of a notification that can becarried out using a pager, a voice mail, a voice synthesized telephonecall, a facsimile, etc. Once the report delivery method has beendetermined 1908, the report is delivered 1910 using the determinedreport delivery method. It should be understood that the report deliverymethod can vary depending upon the nature of the report. For example,urgent reports can utilize one or more delivery methods that are morelikely to reach the recipient immediately, such as a page or a mobiletelephone call. Hence, the report can be delivered in a variety ofdifferent ways depending upon the application, circumstances andconfiguration of the management system. Following the delivery 1910 ofthe report, the GUI report processing 1900 is complete and ends.

[0151] FIGS. 20-29 are screen shots of a representative Graphical UserInterface (GUI) suitable for use with one embodiment of the presentinvention. These screen shots detail how to create and maintain rulesusing the GUI.

[0152] How to Build a Rule Using Resources

[0153] To add (create) a rule, a user would access an Add New Rule page,such as shown in FIG. 20. Here, the user would perform the first step offour steps to follow in order to add a new rule. Namely, the user wouldenter a name and description for the rule and select a rule pack itbelongs to. Upon pressing a Submit button, the process proceeds to thenext step where you define the situation or the left-hand side of arule, i.e. the conditions under which the rule will fire. Or, in otherwords, a list of situations and events (When this happens . . . ) whichlead to the actions specified under the “Then define situation or dothis . . . ” header, which is referred to as the right-hand side of therule. Predicates of the left-hand side are called antecedents andelements of the right-hand side are called consequents.

[0154] As shown in FIG. 21, to build the left-hand side of a rule, firstchoose a knowledge domain from a Domains list on the left side of thescreen. After a domain is selected from the list the selection box belowwill be show all resources of that domain. There are two kinds ofdomains, physical and special (or virtual). A physical domain representsa collection of resources pertaining to a software component or anentire software product, for instance the Java Virtual Machine, asopposed to a special, or virtual domain. A special domain represents aset of resources, which aren't associated with any “physical” knowledgedomain. Instead such resources are used by the manager as buildingblocks to express conditions of the left-hand side or form actions onthe right-hand side of a rule. In the representative rule being built,both a physical domain resource and a virtual domain resource are used.First, select the jvm domain from the list of domains and two resourcesof that domain to the right-hand side of the rule (see FIG. 21).

[0155] Once we have selected all the resources used to define thesituation, the “proceed to next step” button is selected. The next stepis where relationships between the selected resources and/or theirthresholds are set to configure the condition for the rule to fire. Now,add a condition to the left-hand side of the rule. This conditionbasically states that when the amount of heap memory currently in use isgreater than a certain percentage of the maximum heap memory available,the rule should fire. In order to add a condition to the left-hand sideof a rule, choose the Filter special domain. As shown in FIG. 22, one ofthe domain resources in the selection box will be Condition. The userjust selects “Condition” and clicks the add button.

[0156] Next, an Edit Parameter button for the condition is selected andthe desired condition expression entered. Here, the condition expressionshown entered in FIG. 23 binds the two JVM resources. The condition istypically defined as an expression. A simple example of a conditionexpression is (a>b).

[0157] Let us look at detail how we came up with the conditionexpression in FIG. 23. Please refer to FIG. 22 for better illustration.Under the “When this happens . . . ” header note that there are threedistinct entries one below the other as follows ?r1 jvm_HeapUsed ?r2jvm_MaxHeapSize ?r3 Condition

[0158] Here ?r1, ?r2 and ?r3 are resource variable names assigned by thesystem to the resources jvm_HeapUsed, jvm_MaxHeapSize and Conditionresources respectively. This is to facilitate the definition of thecondition expression using the resource variable names only. Asimplified example of a condition expression using resource ?r1 is(?r1>1000000), which states that the rule is considered true (or, gets“fired”) in case jvm_HeapUsed exceeds 1000000 bytes or 1 MB. Note that,in this expression ?r1 and 1000000 are operands and > is a comparatoroperator in between the two operands.

[0159] In the condition expression ?r1>(?r2*060) in FIG. 23, thecondition states that the rule is considered to be true if JVM heapbeing currently used (jvm_HeapUsed), ?r1, is greater than 60% of (or0.60 times) the maximum allowed heap size (jvm_MaxHeapSize), ?r2.

[0160] Now, as the left-hand side of the rule has been built, let usspecify using the Configure Action(s) page shown in FIG. 24 to indicatewhat we want the system to do when the condition becomes true. Let'srequest the system produce a report on the class whose objects occupymost of the JVM heap and request a report on objects of the classes thusidentified are allocated on the heap during the following 15 seconds.

[0161] Setting Up a Rule for Auto-Diagnostics

[0162] In order to test the rule that has been created (and also makesure that all components of the products are installed properly andcommunicate with each other), the manager should be set so that itconsiders the rule when the rule evaluation engine is started. Everyrule can be configured in a flexible way. For instance, it can be set tobe tested every 10 seconds, or every minute, or every hour. If you wanta trial run of the rule as you run the engine, select a special optionon the list of possible intervals, “once only,” can be chosen. Thetesting interval can be set on the same Rule Editing page as shown inFIG. 25.

[0163] Chaining of Rules

[0164] The rule shown in FIG. 24 is a rule that defines conditions foran abnormal situation. If the defined situation occurs, the system isrequested to take one or more actions. In this representative example,the actions are the two request for jvm_TopHeapObjects andjvm_AllocTrace on the right-hand side of the rule, under the “Thendefine situation or do this . . . ” header. This kind of rule is useful,but its capabilities are limited. If instead of taking action rightthere in the rule, a situation is defined, then another rule can bebuilt so that it gets triggered when this situation has beenencountered. Through this mechanism, rules can be chained andhierarchies, or trees, of rules can be built.

[0165] For example, for this rule to be turned into a rule that canpotentially be chained to other rules, a new situation has to bedefined, see FIG. 26. The situation can then be added to the rule as aconsequent, see FIG. 27.

[0166] Thereafter, as desired, another rule or a set of rules can bedefined with JVMLowMemory as the antecedent and the system willautomatically chain these rules, i.e., the set of rules defined withJVMLowMemory on the left-hand side of the rule, will fire when thesituation in FIG. 27 is declared in the modified rule in FIG. 24.

[0167] Editing Rules

[0168] A previously defined (added) rule can be edited. To edit anexisting rule, go to the Rule Management page, such as shown in FIG. 28,select an existing rule and click on the Edit button.

[0169] Starting and Stopping the Rule Engine

[0170] After a rule or a chain of rules has been created the system isready to monitor the software on the managed nodes. In order to initiatethis process, from the Rule Management page, start the rule engine byclicking on the (Re)Start Engine button. If the rules engine has to bestopped, press the Stop Engine button in the Rule Management page. Ifany of the rules were edited or new rules were added and you want thesechanges to take effect, the (Re)Start Engine button in the RuleManagement page has to be pressed. This will cause the engine to stop,automatically pick up any changes that have been made, and restart.

[0171] Note that every time the manager process is started, the RuleEngine status can be Ready. The current status of the engine isdisplayed in the top right hand corner in the Rule Management page. Forthe rules to be fired according to time and condition set in itsdefinition, the (Re)Start Engine button in the Rule Management pageneeds to be pressed explicitly. This changes the status of Rule Enginefrom Ready to Running. You have to do this every-time you add or makechanges to rules and want the Rule Engine to pick up theadditions/changes. As the engine gets into the running state, it checksresource values of the rules set up for periodic checking. In case allconditions on the left-hand side of such rule become valid, the enginewill proceed with the actions on the on the right-hand side of the rule,after which the rule will become blocked for as long as the conditionsare valid. Then, the rule will be marked active again. All activities ofthe engine in respect to rule firing and subsequent actions arereflected on the Report page. The page can be accessed through theReport button on the Rule Management page, such as shown in FIG. 28.

[0172] Report

[0173] The Report page for our example above, with heap usage reduced to1% and allocation tracing time reduced to 5 seconds, is shown in FIG.29. The Report page has several functional buttons which areself-descriptive: a Refresh button is used for updates of the page so itreflects the latest report information, a Clear button will render thereport page empty, a Mail button will allow the report to be sent viae-mail and the Done button will take you back to the main page, the RuleManagement page.

[0174] The sample report shown in FIG. 29 is a result of running of therule defined and shown in FIG. 24. The report reflects all importantevents associated with the system having run with the rule beingactivated for diagnostics. The first line of the report indicates thatrule JVMHeap was fired and for what system the conditions of the rulebecame true and when it happened. Then values of the resources on theleft-hand side of the rule, which led to the rule being triggered areshown. Under the Actions taken header the resources of the right-handside are shown. First, the list of the classes whose objects take upmost of the space on the JVM heap is requested. Filters excluding allstandard classes (java.*, javax.*) are applied so that only two classesappear on the list. This is because the application run by our JVM istruly simple. The second action is a 15 second allocation trace reportfor objects of the classes found on the top heap objects list. Underjvm_AllocTrace you can see all allocations of objects of the twoclasses. Each allocation trace shows where, in what method of whatclass, it took place. It also shows the line number in the source codefor that class, if available (such would be available when the sourcecode was compiled without disabling the debugging informationgeneration).

[0175] The invention can be implemented in software, hardware, or acombination of hardware and software. The invention can also be embodiedas computer readable code on a computer readable medium. The computerreadable medium is any data storage device that can store data which canbe thereafter be read by a computer system. Examples of the computerreadable medium include read-only memory, random-access memory, CD-ROMs,magnetic tape, optical data storage devices, carrier waves. The computerreadable medium can also be distributed over a network coupled computersystems so that the computer readable code is stored and executed in adistributed fashion.

[0176] The many features and advantages of the present invention areapparent from the written description, and thus, it is intended by theappended claims to cover all such features and advantages of theinvention. Further, since numerous modifications and changes willreadily occur to those skilled in the art, it is not desired to limitthe invention to the exact construction and operation as illustrated anddescribed. Hence, all suitable modifications and equivalents may beresorted to as falling within the scope of the invention.

What is claimed is:
 1. A management system for at least one computersystem, comprising: a plurality of agents residing within managed nodesof a plurality of different products used within the computer system;and a manager for said management system, said manager being operableacross the different products.
 2. A management system as recited inclaim 1, wherein said management system further comprises a knowledgebase.
 3. A management system as recited in claim 2, wherein a pluralityof different users across multiple organizations contribute to theinformation within said knowledge base, and wherein said knowledge basedis shared and used by a plurality of different organizations to managetheir computer systems, the different organizations having the same ordifferent product configurations on their computer systems.
 4. Amanagement system as recited in claim 2, wherein said knowledge base isupdated by contributions from other users, and an update to saidknowledge base is distributed to said manager via the Internet.
 5. Amanagement system as recited in claim 2, wherein said knowledge basecontains information pertaining to the different products.
 6. Amanagement system as recited in claim 5, wherein the informationpertaining to the different products includes at least rules.
 7. Amanagement system as recited in claim 6, wherein said manager furthercomprises an inference engine that evaluate the rules.
 8. A managementsystem as recited in claim 2, wherein the information in said knowledgebase is described using thresholds on resource information.
 9. Amanagement system as recited in claim 2, wherein the information in saidknowledge base is described using thresholds on resource information andrelationships between resource information from the different products.10. A management system as recited in claim 2, wherein the informationin said knowledge base is provided thereto by a plurality of differentpersons located at different positions.
 11. A management system asrecited in claim 10, wherein the different persons provide theinformation to said knowledge base through use of graphical userinterfaces.
 12. A management system as recited in claim 2, wherein saidagents provide said manager with only symptom information that isrelevant to that identified within the knowledge base as a possibleknown causing symptom for previously observed problems.
 13. A managementsystem as recited in claim 12, wherein each of the different productsbeing managed has at least one application component, and wherein thesymptom information pertains to resource information form theapplication component of the different products.
 14. A management systemas recited in claim 1, wherein said manager determines which of at leastone of the different products a problem arises.
 15. A management systemas recited in claim 14, wherein after said manager determines which ofat least one of the different products the problem arises, initiates areporting and/or corrective action.
 16. A management system as recitedin claim 1, wherein said management system performs just-in-time or nearreal-time diagnosis of which of at least one of the different products aproblem arises or is likely to arise.
 17. A management system as recitedin claim 1, wherein said management system operates on a separatecomputer that is connected to the computer system being managed.
 18. Amanagement system as recited in claim 1, wherein said management systemis for a plurality of computer systems, and wherein said manager iscentral and operable across the different products on the differentcomputer systems.
 19. A management system as recited in claim 1, whereinsaid knowledge base is updated by contributions from other users, and anupdate to said knowledge base is distributed to said manager throughknowledge data files.
 20. A management system in claim 1, wherein themanaged nodes can be managed by more than one of said managementsystems.
 21. A method for managing an enterprise computer system, saidmethod comprising: receiving a fact pertaining to a condition of one ofa plurality of different products that are operating in the enterprisecomputer system; asserting the fact with respect to an inference engine,the inference engine using rules based on facts; retrieving updatedfacts from the inference engine from those of the rules that aredependent on the fact that has been asserted; and performing an actionin view of the updated facts.
 22. A method as recited in claim 21,wherein the action in at least one of a corrective action or a debugdata acquisition.
 23. A method as recited in claim 22, wherein thecorrective action or the debug action further asserts at least one factwith respect to the inference engine.
 24. A method as recited in claim21, wherein said method further comprises: making a log entry.
 25. Amethod as recited in clam 24, wherein said method further comprises:retrieving log data from the log entry; and generating a report from thelog data.
 26. A method as recited in claim 25, wherein said methodfurther comprises: determining a report delivery method; and deliveringthe report using the determined report delivery method.
 27. A method asrecited in claim 21, wherein the inference engine obtains resourceinformation from the different products to process rules based on atleast one fact already asserted.
 28. A method as recited in claim 21,wherein the different products send resource information to theinference engine to process rules that depend on at least one factrepresenting the resource information.
 29. A method for isolating a rootcause of a software problem in an enterprise computer system supportinga plurality of software products, said method comprising: forming aknowledge base from causing symptoms and experienced problems providedby a disparate group of contributors; and examining the knowledge basewith respect to the software problem to isolate the cause of thesoftware problem to one of the software products.
 30. A method asrecited in claim 29, wherein the knowledge base further includescorrective actions for certain software problems.
 31. A method asrecited in claim 30, wherein said method further comprises:automatically performing a correction operation to attempt correction ofthe software problem, the correction operation being associated with oneof the corrective actions within the knowledge base that is suitable foruse with the software problem.
 32. A method as recited in claim 29,wherein the causing symptoms pertain to resources of the softwareproducts.
 33. A method as recited in claim 29, wherein a repeatingsymptom cause can be ignored till after the cause ceases to exist atleast once.